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Abstract:  Computational  models  of  the  human  stereo  system  can  provide  insight  into  general 
information  processing  constraints  that  apply  to  any  stereo  system,  cither  artificial  or  biological. 
In  1977,  Marr  and  Poggio  proposed  one  such  computational  model,  that  was  characterized  as 
matching  certain  feature  points  in  diffcrcnce-of-Gaussian  filtered  images,  and  using  the  information 
obtained  by  matching  coarser  resolution  representations  to  restrict  the  search 'space  for  matching 
finer  resolution  representations.  An  implementation  of  the  algorithm  and  its  testing  on  a  range 
of  images  was  reported  in  1980.  Since  then  a  number  of  psychophysical  experiments  have 
suggested  possible  refinements  to  the  model  and  modifications  to  the  algorithm.  As  well,  recent 
computational  experiments  applying  the  algorithm  to  a  variety  of  natural  images,  especially  aerial 
photographs,  have  led  to  a  number  of  modifications.  In  this  article,  we  present  a  version  of  the 
Marr-Poggio-Grimson  algorithm  that  embodies  these  modifications  and  illustrate  its  performance 
on  a  series  of  natural  images. 
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1.  Introduction 


The  ability  of  a  sensory  system  to  passively  sense  ihe  Uircc  dnncnsion.il  sirucinrc  of  its 
surrounding  environment  is  frequently  .1  necessary  precursor  to  cllicicnt  imcr.ictions  with  that 
environment,  both  for  biological  and  artificial  systems.  A  common  method  for  performing  Uiis 
sensing  is  through  stereo  vision,  and  in  fact,  the  human  stereo  system  is  remarkably  adept  at  Uiis 
compulation,  under  a  wide  variety  of  conditions.  Stereo  vision  can  be  characterized  by  three  steps: 
(1)  I  he  point  in  one  image  corresponding  to  the  projection  of  a  point  on  a  surface  is  located  (2) 
flic  point  in  the  other  image  corresponding  to  the  projection  of  die  some  surface  point  is  located. 
(3)  The  difference  in  projection  of  die  corresponding  points  is  used,  together  with  estimates  of  the 
parameters  of  the  imaging  geometry  (which  may  he  determined  solely  frmo  die  correspondences), 
to  determine  a  measure  of  die  distance  to  die  surface  point.  While  all  three  steps  arc  important 
to  die  process,  die  second  stage  has  usually  been  considered  the  critical  one.  To  deal  with  this 
correspondence  problem,  and  its  concomitant  problem  of  avoiding  false  targets  in  determining 
the  correct  correspondence  or  match,  concern  has  centered  on  appropriate  representations  for 
matching,  and  on  constraints  on  the  matching  process  that  will  ensure  the  correct  correspondence 
is  chosen. 

While  psychophysical  evidence  concerning  the  nature  of  the  human  stereo  system  has 
been  accumulating  for  some  lime,  recently  attention  also  has  been  focused  on  computational 
investigations  of  die  system.  One  goal  of  these  investigations  has  been  to  consider  models  of  the 
information  processing  aspects  of  the  system,  independent  to  a  large  extent  of  the  specifics  of  the 
mechanism  that  performs  the  computation.  While  such  models  arc  of  importance  in  understanding 
the  processing  of  the  human  system,  this  relative  independence  of  the  algorithm  used  by  the 
human  system  and  its  specific  implementation  in  neural  units  also  suggests  that  such  algorithms 
may  have  implications  for  non-biological  applications. 

In  1977,  Marr  and  Poggio  proposed  a  feature-point  based  model  of  aspects  of  human 
stercopsis  [Marr  and  Poggio,  1979J.  A  computer  implementation  of  their  algorithm  was  then 
developed  and  tested  [Grimson.  1981a,  b].  Initially,  the  implementation  was  evaluated  on  standard 
psychological  test  images,  in  particular,  random  dot  stereograms  [Julcsz,  1960,  1971].  The  intent  of 
this  investigation  was  to  demonstrate  the  adequacy  of  the  Marr-Poggio  model  for  such  patterns, 
and  to  demonstrate  the  consistency  of  the  model  with  known  aspects  of  human  stereo  perception, 
including  situations  in  which  the  system  fails.  The  implementation  was  also  tested  on  a  number  of 
natural  images,  under  a  variety  of  illumination  conditions  and  with  a  variety  of  different  surface 
materials.  Since  the  original  presentation  of  the  Marr-Poggio  model,  a  number  of  additional 
psychophysical  predictions  of  the  model  have  been  tested,  and  consequently,  several  modifications 
and  improvements  have  been  proposed  [c.g.  Mayhcw  and  Frisby,  1981;  Frisby  and  Mayhew,  1980; 
Mowforth,  Mayhcw  and  P'risby,  1981;  Schumcr  and  Julcsz,  1982]. 

While  examining  the  psychophysical  aspects  of  the  model  is  dearly  of  importance  for 
perceptual  modelling,  computational  experiments  with  the  algorithm  can  also  provide  insights 
into  the  information  processing  aspects  of  the  model.  Such  experiments  arc  also  of  importance 
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when  considering  applications  ol'  the  algorithm  fo  donums  <i(her  llun  modelling  ol  ihe  hnnun 
system,  as  are  non-biologicallv  based  studies  of  1'cattit e  point  stereo  vision  systems  |c.g  Arnold  and 
liinford.  1980:  linker.  1982;  linker  and  liinfnrd.  1981:  liarnard  and  lltompson.  1980.  Mor.oee, 
1977,  1980;  Olita  and  Kanade.  198.1  (sec  also  the  technique  of  kass  198.1.  1984.  which  may  also 
be  applicable  to  feature  point  stereo)!,  following  die  original  testing  ol  the  Marr-Poggio-Cirimson 
algorithm,  as  reported  previously  [Crimson.  1981a.  b.  with  some  modifications  proposed  in  Marr 
and  Poggio.  1980).  extensive  additional  computational  experiments  with  the  algorithm  have  been 
performed,  especially  on  natural  images.  These  experiments  have  led  to  a  number  of  modifications 
to  die  original  algorithm,  as  well  as  elucidating  points  diat  require  additional  attention.  While  no 
inference  is  made  as  to  die  relevance  of  such  modifications  for  die  human  system,  the  modified 
algorithm  may  serve  as  a  useful  step  towards  an  automated  artificial  stereo  system. 

In  dus  paper,  wc  will  briefly  review  the  original  Marr-Poggio  model  and  outline  the  previously 
reported  implementation  and  testing  of  diat  algorithm.  We  will  dicn  describe  some  of  die  open 
questions  concerning  that  implementation,  as  well  as  some  of  the  modifications  suggested  by  other 
models  [c.g.  Mayhcw  and  Frisby,  1981],  A  revised  algoridim  will  then  be  presented.  Finally,  we 
will  illustrate  the  performance  of  the  modified  algorithm  by  applying  it  to  a  scries  of  natural 
images.  Many  of  the  examples  presented  arc  aerial  stereo  photographs,  in  pan  because  automated 
canography  is  one  of  the  traditional  areas  of  application  of  computer  stereo  algorithms.  We  also 
consider  an  example  of  a  robotics  application,  and  investigate  the  accuracy  of  the  algorithm  in 
reconstructing  the  distance  to  objects  in  the  scene,  given  measurements  for  die  parameters  of  the 
imaging  geometry. 

2.  The  Marr-Poggio  Stereo  Model 

In  this  section,  wc  present  a  brief  review  of  the  original  Marr-Poggio  model  [Marr  and 
Poggio.  1979],  its  original  implementation  (Grimson.  1981a,  b]  and  suggested  modifications  based 
on  psychophysical  and  computational  studies  [e.g.  Mayhcw  and  Frisby.  1981).  Readers  interested 
in  more  comprehensive  treatments  are  directed  to  the  original  articles. 

2.1.  The  Model 

The  algorithm  proposed  by  Marr  and  Poggio  for  solving  the  stereo  correspondence  problem 
can  be  described  as  a  feature-point  based  matching  system,  using  a  coarse  to  fine  control  strategy 
to  limit  the  search  space  of  possible  matches.  As  originally  proposed  [Marr  and  Poggio,  1979],  the 
algorithm  consisted  of  the  following  steps. 

(1)  The  left  and  right  images  arc  each  filtered  with  oriented  second  differential  operators 
of  four  sizes  that  increase  in  size  with  eccentricity  (distance  from  the  center  of  the  eye). 

The  cross-section  of  these  operators  is  approximately  the  difference  of  two  Gaussian 
functions  with  space  constants  in  the  ratio  1:1.75.  The  purpose  of  this  filtering  is  to 
allow  the  detection  of  significant  intensity  changes  at  multiple  scales. 

(2)  Zero-crossings  in  the  filtered  images  arc  located  by  scanning  along  lines  lying 
perpendicular  to  the  orientation  of  the  original  differentia)  operator.  These  zero-crossings 
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mark  the  locations  of  significant  changes  in  the  original  intensity  function.  at  diflerent 
scales.  I’osiiuiiis  ut  the  ends  of  lines  and  edges  .tie  also  located. 

(.1)  l  ot  each  operator  si/e  and  orientation,  matching  takes  place  hem  ecu  /cro-crossing 
segments  or  terminations  of  die  same  contrast  sign  in  the  mo  images,  for  a  range  of 
disparities  up  to  about  the  width  of  the  operator's  central  region.  Within  this  disparity 
range.  Marr  and  1‘oggio  showed  that  false  targets  pose  only  a  simple  problem,  because 
of  die  roughly  bandpass  nature  of  the  filters. 

(4)  Disparity  information  obtained  by  matching  features  derived  from  the  larger  operators 
can  control  \ergcnce  eye  movements,  dius  allowing  feature  from  die  smaller  operators 
to  come  into  correspondence.  In  diis  way.  die  matching  process  gradually  moves  from 
dealing  with  large  disparities  at  a  low  resolution  to  dealing  with  small  disparities  at  a 
high  resolution  (see  also,  for  example.  Moravee,  1980], 

(5)  When  a  correspondence  is  achieved,  it  is  stored  in  a  dynamic  buffer,  called  die 
2 1 -dimensional  sketch  (Marr,  1978]. 

2.2.  I  hc  Original  Implementation 


The  first  computer  implementation  of  this  model  was  reported  in  (Grimson.  1981a]  (recently 
an  independent  ^implementation  of  the  algoridim  has  been  reported  in  |Kak.  1983]).  The  original 
implementation  essentially  followed  the  five  steps  outlined  above,  although  there  were  a  number  of 
differences.  Most  of  these  changes  arose  from  observations  made  during  the  process  of  transferring 
the  model  described  above  to  a  working  algorithm,  since  the  process  of  explicitly  detailing  the 
algorithm  illuminated  some  previously  unforeseen  difficulties,  whose  solutions  led  to  modifications 
to  the  original  model. 

The  steps  in  the  implementation  can  be  briefly  outlined  as  follows. 

(1)  Image  Filtering:  The  left  and  right  images  of  a  stereo  pair  are  convolved  with  a  series  of 
two-dimensional  operators,  whose  shape  is  given  by  the  Laplacian  of  a  Gaussian  : 


or  by  an  approximation  to  this  operator,  using  i  difference  of  two  Gaussian  functions  (Marr  and 
Hildreth,  1980],  These  operators  arc  isotropic  with  respect  to  orientation,  and  hence  differ  from 
the  directional  operators  proposed  in  the  model.  (A  discussion  of  this  point  may  be  found  in 
(Grimson,  1981a,  b].)  The  size  of  the  operator,  as  well  as  its  spatial  frequency  characteristics,  is 
determined  by  the  value  of  the  constant  a.  which  is  related  to  the  width  of  the  central  negative 
portion  of  the  operator,  w,  by  the  following  expression: 

w 

2v/2 

Figure  1  illustrates  the  form  of  these  operators. 

If  each  picture  element  (pixel)  is  considered  equivalent  to  one  photoreceptor  in  the  fovea  of 
the  human  visual  system,  then  we  may  use  psychophysical  data  obtained  from  measurements  on 
the  human  system  (e.g.  Wilson  and  Bergen.  1979]  to  determine  the  appropriate  sizes  of  operators. 
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Figure  1.  The  initial  fillers.  Each  image  is  convolved  with  a  two-dimensional  operator  whose  form  is 
described  by  a  Laplacian  of  a  Gaussian.  The  si/e  of  the  operator  is  determined  by  the  space  constant  of 
the  Gaussian  distribution.  Part  a  shows  a  perspective  plot  of  a  V:C  filler,  pan  b  shows  a  one-dimensional 
slice  through  the  center  of  the  filler. 

This  led  us  to  implement  V2G  operators  with  widths  of  vs  =  9, 18,30  and  72  picture  elements 
(pixels)  each.  It  has  also  been  argued  on  computational  grounds  [Marr.  Poggio  and  Hildreth,  1979J 
and  on  vernier  acuity  grounds  {Crick,  ct  al.  1980}  that  an  additional  smaller  operator  corresponding 
roughly  to  a  width  of  w  =  4  may  also  be  present  in  the  human  system.  The  coefficients  of  the 
operators  were  represented  to  a  precision  of  1  part  in  2048.  Coefficients  of  magnitude  less  than 
soTS  th  of  the  maximum  value  of  the  operator  were  set  to  zero.  Thus,  the  truncation  radius  of  the 
operator  (the  point  at  which  all  further  operator  values  were  treated  as  zero)  was  approximately 

1.8ui. 


(2)  Symbolic  Features:  In  the  original  Marr-Poggio  theory,  the  elements  to  be  matched 
betwee  images  were  (i)  zero-crossings  whose  orientations  arc  not  horizontal,  and  (ii)  terminations. 
It  has  since  been  demonstrated  [Ntshihara  and  Poggio.  1982]  that  aspects  of  human  stereo 
perception  previously  believed  to  imply  the  need  for  terminations  may  be  explained  strictly  on  the 
basis  of  zero-crossings.  lints,  terminations  arc  not  included  in  the  implementation  reported  here. 
It  is  assumed  that  the  images  have  been  brought  into  vertical  registration,  so  that  the  cpipolar 
lines  arc  horizontal,  lhus.  zero-crossings  in  the  convolved  images  arc  found  by  scanning  along 
horizontal  lines,  seeking  pairs  of  adjacent  elements  of  opposite  sign,  or  triplets  of  adjacent  elements. 


4 


( iimwm 


Sll  UH  \  IMUII 


Figure  2.  An  example  of  a  stereo  pair  taken  in  a  laboratory  setting. 

the  middle  of  which  is  zero,  the  other  two  containing  convolution  values  of  opposite  sign.  The 
positions  of  the  zero-crossings  are  thus  recorded  to  within  an  image  clement.  In  addition  to  their 
location,  two  other  attributes  of  the  zero-crossings  were  recorded:  (1)  contrast  sign  (whether  the 
convolution  values  change  from  positive  to  negative,  or  negative  to  positive,  as  we  move  from  left 
to  right  along  the  scan  line)  and  (2)  a  rough  estimate  of  the  local  orientation  in  the  filtered  image 
of  segments  of  the  zero-crossing  contour.  In  the  original  implementation,  the  orientation  of  a  point 
on  a  zero-crossing  contour  was  computed  as  the  direction  of  the  gradient  of  the  convolution  values 
across  that  segment,  and  was  recorded  in  increments  of  30  degrees. 

Examples  of  the  convolutions  and  zero-crossings  for  a  series  of  operators  arc  illustrated  in 
Figures  2,  3,  and  4. 

We  note  that  while  the  positions  of  the  zero-crossings  are  specified  to  within  a  pixel,  it 
may  be  possible  to  perform  subpixel  localization.  Hildreth  [1980]  (sec  also  [Crick,  et  al„  1980] 
has  demonstrated  that  in  the  case  of  an  isolated  zero-crossing,  a  simple  linear  interpolation 
between  convolution  values  serves  to  localize  the  zero-crossing  to  subpixcl  precision  [see  also. 
MacVicar-Whelan  and  Binford,  1981].  It  has  been  observed  in  computational  experiments  that 
strong  isolated  zero-crossings,  such  as  those  corresponding  to  occluding  boundaries  or  shadows, 
for  example,  can  be  reliable  matched  to  subpixel  precision.  In  the  presence  of  texture  or  other 
confounding  photometric  effects,  however,  the  accuracy  of  the  subpixcl  localization  decreases, 
and  is  probably  not  effective.  This  raises  an  interesting  question  about  human  stereo  acuity.  It 
suggests  that  for  stimuli  with  isolated  zero-crossings,  (for  example,  line  drawings),  stereo  acuity 
could  lie  within  the  subpixcl  range  [Howard,  1919:  Woodbumc.  1934;  Berry,  1948;  Tyler.  1977], 
but  for  textured  stimuli,  (for  example,  random  dot  stereograms),  stereo  acuity  might  be  expected 
to  decrease. 

(3)  Matching;  Given  a  set  of  zero-crossing  representations  at  different  scales  for  each  of  the 
images,  the  matching  process  proceeded  in  a  coarse  to  fine  iterative  manner.  The  idea  [first  used  by 
Moravcc,  1977,  1980]  is  to  use  a  sparse  representation  of  the  images,  with  a  coarse  spatial  sampling. 
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Figure  3.  Convolutions  of  the  blocks  images. 

for  the  initial  matching  of  points.  The  reduced  density  of  points  greatly  reduces  the  search  space 
and  makes  matching  easier,  at  the  expense  of  reduced  resolution.  This  initial  match  can  then  be 
used  to  constrain  the  matching  of  finer  detailed  representations,  again  reducing  the  search  space 
of  the  matching  process,  while  allowing  finer  detail  disparity  information  to  be  obtained.  Thus, 
the  matching  is  guided  by  a  flow  of  information  from  coarse  representations  to  finer  ones. 

(3.1)  Feature  Point  Matching:  Consider  first  the  zero-crossing  representations  obtained  from 
the  coarsest  fillers  (with  central  width  uv).  and  suppose  that  we  arc  given  some  estimate  d,  of  the 
disparity  in  a  region  of  the  image  (which  wc  may  initially  assume  to  be  some  arbitrary  value).  For 
a  zero-crossing  in  one  image  (say  die  left)  at  position  (j,y).  the  search  for  a  matching  zero-crossing 
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Figure  4.  Zero-crossings  of  the  blocks  images. 

in  the  right  image  is  constrained  to  the  region 

{(**,  y)  |  x  +  d,  -  wc  <  x'  <  x  +  di  +  wc}. 

(Note  that  the  search  takes  place  along  the  same  horizontal  scan  line,  thereby  assuming  that  the 
images  have  been  registered  so  as  to  yield  horizontal  cpipolar  lines.)  This  ±w,  range  in  the  right 
image  is  divided  into  three  pools,  two  larger  convergent  and  divergent  regions,  and  a  smaller  one 
lying  centrally  between  them.  For  each  pool,  matching  zero-crossings  in  the  left  and  right  filtered 
images  must  have  (1)  the  same  contrast  sign,  and  (2)  roughly  the  same  orientation. 

A  match  is  assigned  on  the  basis  of  the  responses  of  the  pools.  If  exactly  one  zero-crossing 
of  the  appropriate  sign  and  orientation  (within  30  degrees)  is  found  within  a  pool,  its  location 
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is  transmitted  lu  Ills'  maiihci.  Il  lw,>  i.indklalc  /eio-iiossmgs  .ns  found  widiui  one  pool  (.1  ici> 
unlikely  c\sm  (ssc.  tin  example.  Gnmson.  1‘ISll'j)  ills  iii.Ul hsi  is  notified  and  no  attempt  is  in.ijs 
lit  assign  .1  match  lei  ills  [itiini  in  question.  If  ills  tn.iiJiei  f mils  ,i  single  /iio-ciossing  in  mils  mis 
of  liis  Dues  pools,  ih.it  m.iish  is  .issspisd.  and  ills  disparity  .issosuisd  with  ills  m.ush  is  rixoided 
in  a  bullet.  It’  isso  m  dirss  ot  dis  pools  contain  a  s.uidid.iis  inalsli.  llis  algorithm  rsiords  dial 
infomiaiimi  tor  fuuirs  disambiguation. 

Onss  all  possihls  unambiguous  mulshes  have  bssn  identified  an  aiismpl  is  mads  in 
dis.imhigtiats  double  or  iripls  inaishss.  this  is  dons  h>  scanning  a  nctglihoihooJ  about  die  point 
in  question  and  recording  the  sign  of  die  dispariiv  ot  ills  unambiguous  niauliss  within  dial 
neighborhood,  tills'  sign  of  die  disp.mu  refers  (o  die  sign  of  die  pool  limn  which  ihs  match 
comes:  divergent.  convergent  or  /e'ro.)  It  dis  ambiguous  pomi  lias  a  poisim.il  m.itsh  of  ills  same 
sign  as  die  dominant  type  within  the  neighborhood,  then  dial  is  ehosen  as  die  match.  Olherwisc. 
the  match  at  dial  point  is  icl'i  ambiguous. 

(3.2)  Continuity:  It  is  possible  that  die  region  under  consideration  docs  noi  lie  within  the 

disparity  range  examined  by  die  matcher.  Ibis  is  detected  and  handled  h>  the  following  operation. 
If  the  region  does  lie  within  die  disparity  range  ±ur.  then  excluding  case  of  occluded  points, 
every  zero-crossing  in  die  region  will  have  at  least  one  candidate  ma  in  die  other  filtered  image. 
On  the  other  hand,  if  the  region  lies  beyond  the  disparity  range  .  .  dien  die  probability  of  a 

given  zero-crossing  having  at  least  one  candidate  match  will  be  iK  0.7  |Marr  and  Foggio. 
1979:  Grimson.  1981a,  bj.  Thus,  by  counting  the  percentage  of.  'r  ossings  within  a  region 
that  have  at  least  one  match,  and  thresholding  based  on  the  probabilities  staled  above,  disparities 
will  be  accepted  only  in  regions  lying  within  die  current  disparity  range.  Ibis  constraint  is  based 
on  the  continuity  assumption  (Marr  and  Foggio.  1979]  that  surfaces  generally  vary  in  a  smooth 
manner  relative  to  the  viewer. 

(3.3)  Control  Strategy:  Finally,  once  this  matching  has  been  performed  for  the  coarsest  filter, 
the  sparse  disparities  obtained  can  be  used  to  realign  die  images,  and  the  process  can  be  repeated 
at  the  next  finer  scale.  Since  the  density  of  zero-crossings  increases  as  the  size  of  the  filter  is 
decreased,  this  coarse  to  fine  control  strategy  allows  the  matching  of  very  dense  zero-crossing 
descriptions  with  greatly  reduced  false  target  problems,  by  using  coarser  resolution  matching  to 
drive  die  alignment  process. 

(3.4)  Vertical  Disparity:  While  the  matching  as  described  above  only  searches  for  corresponding 
zero-crossing  points  along  the  same  horizontal  scan  lines,  the  control  strategy  of  die  algorithm 
can  easily  he  modified  to  handle  small  amounts  of  vertical  disparity.  First,  note  that  due  to  the 
size  of  the  V‘<7  filters,  the  coarser  level  zero-crossing  representations  arc  loss  sensitive  to  local 
vertical  disparity  dian  the  finer  level  ones.  Now  suppose  drat  die  matching  has  been  performed 
for  the  coarsest  filter  and  that  die  horizontal  and  vertical  disparity  in  a  region  of  the  image  is 
roughly  given  by  d  and  r  respectively.  When  proceeding  to  a  finer  filter,  die  search  for  matching 
zero-crossings  is  initially  centered  about  this  disparity.  If.  however,  the  density  of  zero-crossing 
points  that  can  be  matched  at  dus  level  is  smjll.  it  is  likely  that  the  horizontal  disparity  is  nearly 
correct,  but  that  the  vertical  alignment  is  in  error.  Ihus.  reapplying  the  matching  process  with 
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the  n.imii  (!■  ‘i  i  >  r  ;.i!  ;  •  riv  :i;  .-  |>ul  vv  ill)  Mil. ill  v .11  uIioiin  (oil  the  « >J  vie 1  n!  se  .  c  1.1I  Ink  s)  in  ilk 
vetli-.il  .ihciiiiiciH  1  ■  .  will  k.u!  iii  .1  corieu  alignment  i>!  the  un.igev  mid  hetkv  te  .1  greater 
JcilMlv  ill  /Ci.>  1  I.ism:..n  King  a'.Mglk'd  valid  dlsp.Mll>  \  .lilies, 

2..V  listing  of  tin  Origin. il  Impli  mentation 

As  ivpoiu-d  111  |( il imsoM.  19S0.  1  C>S 1 1.  tins  implementation  of  the  M.11  i-lWgio  algoiuhm 
h.is  been  levied  on  .1  variety  of  images  Much  of  the  origin. ii  testing  ".is  performed  01.  unjum 
dot  steteogi.ims.  tor  two  te.isotis  I  11st.  because  the  stereograms  are  sviiriictk.il!>  created,  it  h 
possible  quant  1  tat  1  \ elc  to  compare  the  disparities  computed  b>  the  algorithm  with  the  plk'i. . 
correct  disparities.  Second,  because  random  dot  stereoeiams  are  a  standard  ps;.t|jo!  .  n.d  u...!c  .: 
lor  examining  attributes  of  the  human  stereo  system.  the  performance  of  the  .ilg'invhm  or 
test  cases  could  be  compared  to  human  perception,  providing  a  means  of  examining  the  adeuua. » 
of  the  underline  model,  l-.xamples  of  die  testing  included  two-planar  stereograms  i  f  v.u.o.g 
densities,  more  complex  figures  such  as  a  wedding  cake  and  a  spiral  staircase,  suro -grams  in 
which  one  or  both  images  had  been  blurred,  stereograms  with  added  spatial  frequence  filtered 
noise,  stereograms  in  which  one  of  die  images  had  been  decorrelated  b>  different  amounts,  and 
stereograms  in  which  one  of  the  images  had  been  compressed.  It  was  found  that  on  the  standard 
random  dot  stereograms,  die  matching  algorithm  performed  very  well,  usual!)  with  an  error  rate 
of  less  than  one  part  in  a  thousand.  On  nois>  or  decorrelated  stereograms,  the  error  rate  was 
normal!)  on  die  order  of  one  percent,  while  the  density  of  points  to  which  a  disparity  was  assigned 
decreased  (and  in  the  limit  vanished). 

The  implementation  was  also  tested  on  a  number  of  natural  images.,  using  a  variety  of 
illumination  geometries  and  with  objects  of  differing  photometric  characteristics.  Examples  included 
a  speckled  coffee  jar.  a  basketball  game,  an  outdoor  metallic  sculpture,  and  a  portion  of  the 
Martian  surface,  for  these  natural  images,  a  quantitative  evaluation  was  more  difficult  to  obtain, 
precisely  because  the  imaging  geometry  was  not  controlled,  but  it  was  observed  that  the  qualitative 
performance  of  the  algorithm  was  still  good. 

2.4.  Discussion 

While  the  initial  testing  of  the  algorithm  did  serv  e  to  support  the  adequacy  of  the  Marr-Poggio 
algorithm  as  a  model  of  aspects  of  the  human  stereo  system,  and  while  die  overall  performance  of 
the  matching  algorithm  was  very  good,  a  number  of  weak  points  in  the  algorithm  w  ere  illuminated 
during  this  testing. 

2.4.1.  Continuity  constraints 

It  was  observed  that  most  of  the  actual  matching  errors  occurred  along  discontinutics  in 
depth,  for  example  at  occluding  boundaries  between  two  objects.  This  follows  from  the  use  of 
matching  statistics  over  a  region  as  a  means  of  distinguishing  correct  matches  from  random  ones. 
Theoretically.  this  test  is  based  on  the  observation  that  surfaces  arc  generally  smooth  relauvc  to 
the  observer,  and  hence  disparity  will  generally  also  he  smooth.  While  the  theoretical  observation 
is  sound,  the  implementation  of  it  by  means  of  a  statistical  measure  over  a  region  of  the  image 
has  some  difficulties. 
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Figure  5  The  problem  ot  the  continuin  cimbir.nm  nc.ir  object  boundaries. 

This  is  most  castle  illiisir.iicd  by  Lhe  following  example.  Suppose  the  region  over  which  the 
matching  statistics  are  measured  is  a  square  of  side  d  (while  tins  is  the  easiest  to  implement,  it 
is  not  critical  and  the  following  argument  holds  for  other  shapes  as  well).  Further  suppose  that 
die  stereogram  consists  of  two  planar  surfaces  with  a  sharp  break  in  disparity  between  them.  I  cl 
the  density  of  zero-crossings  be  p  and  presume  that  the  region  is  positioned  such  dial  \  percent 
of  the  region  covers  surface  A  and  that  1  -  \  percent  covers  surface  IS  (sec  Kiguie  5).  Final!), 
assume  that  the  fixation  of  the  eves  is  currently  positioned  on  surface  B.  so  that  the  portion  of 
the  region  covering  the  surface  A  is  out  of  range  of  the  matching  process.  If  (  is  the  threshold 
for  accepting  the  matches  in  a  region  as  being  within  die  range  of  the  matcher,  (for  the  analysis 
of  Marr  and  Poggio  [1979.  p3 1 7]  0.7  <  <  <  1.0).  then  die  question  to  consider  is  for  what  values 
of  x  the  percentage  of  matched  points  in  the  region  will  exceed  t. 

In  theory,  the  number  of  matched  points  in  the  surface  B  region  is  expected  to  be  pd(d  -  i). 
and  the  number  of  matched  points  in  the  surface  A  region  is  expected  to  be  Q.lpxd.  Thus,  the 
percentage  of  matched  points  is  given  by 

pd(d  -  x)  -i-  0  7pxd  x 

pd(d  -  x)  -+■  pxd  d 

The  values  of  x  for  which  this  percentage  exceeds  <  is  given  by 

x  <  LiU 
-  0.3 

The  most  conservative  threshold  would  be  t  =  l.  in  which  case  x  =  0  and  the  only  position 
of  the  region  for  which  the  disparity  values  arc  accepted  as  correct  is  that  in  which  the  region  is 
entirely  positioned  over  surface  B.  While  this  would  work  on  perfect  dam.  in  practice  it  is  likely 
to  be  overly  conservative,  causing  a  large  reduction  in  the  percentage  of  zero-crossings  to  which 
a  disparity  is  assigned,  although  the  error  rate  should  be  virtually  zero.  One  difficulty  with  real 
data  is  that  even  for  regions  of  the  image  whose  disparities  arc  completely  within  range  of  the 
matcher,  the  zero-crossing  points  may  not  all  have  matches.  For  cxjmple.  geometric  distortion  in 
the  sensors,  perspective  distortions  in  the  imaging  geometry,  noise  in  the  irradiancc  values  and  local 
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I>!ii>Iiii])iUIl  dK'vis  .ill  c .m  i.iiiv  slight  v. illations  in  ihc  /cro-uossings  th.it  t n . i >  icmiIi  hi  ,i  small 
number  of  unmatched  points  Rather  tli.m  dise.uj  .ill  the  disparity  information  in  ,i  region  because 
,i  single  /cii'-vii'SMiie  pi  mil  dues  nut  have  an  assigned  match.  we  would  like  to  pieseive  such 
information  by  using  a  ,ess  consenaiixe  threshold.  Consider,  however,  the  compromise  case  of 
i  0  S  i  In  tins  case  the  amis  on  the  positioning  of  the  region  are  given  by  0  <  j-  <  o.M. 
and  in  this  case,  any  iiikonest)  dispaniv  values  lung  within  (t  ad  pixels  of  the  edge  of  surface  B 
will  he  accepted  as  conea  I  his  is  observed  in  examples  of  the  testing  of  the  algorithm,  and  while 
die  number  of  such  eriors  is  small.  ,t  is  unavoidable  within  die  context  of  di is  ts pc  of  statistical 
ehc\k.  I  his  problem  will  be  verv  apparent  in  die  case  of  diin  elongated  surfaces  suspended  above 
a  background,  where  the  widths  of  the  surfaces  are  less  dian  die  diameter  of  the  statistics  region, 
for  example,  in  an  aerial  stereo  image  of  a  highway  interchange. 

One  means  of  overcoming  this  problem  is  to  observe  dial  while  it  is  difficult  to  ensure 
that  a  region  of  die  image  corresponds  strictly  to  a  single  surface,  edges  (or  /cro-crossings)  in  a 
filtered  image  will  generally  correspond  to  a  single  surface,  since  they  usually  reflect  changes  in  the 
surface  topography  or  die  surface  photometry.  Thus.  rather  than  imposing  a  condition  of  disparity 
continuity  over  an  area  of  die  image,  one  could  instead  require  a  continuity  of  disparity  along  a 
contour  in  the  filtered  image.  Ibis  is  essentially  the  figural  continuity  constraint  of  Max  hew  and 
Frisby  [1981],  and  has  been  suggested  in  a  slightly  different  form  in  Arnold  and  Binford  [1980], 
Thus,  we  need  to  derive  a  contour  based  analog  to  the  regional  continuity  check  used  in  the 
original  Marr-Poggio  implementation. 

Once  the  feature  points  have  been  matched,  it  can  be  observed  that  the  collccdon  of  all 
matched  points  is  composed  to  two  distinct  sets.  In  regions  of  the  image  where  the  zero-crossing 
representations  lie  within  matching  range  of  the  current  image  alignment,  the  matched  feature 
points  tend  to  form  extended  contours.  Elsewhere,  the  matched  feature  points  tend  to  lie  in 
scattered  small  segments.  The  goal  of  the  figural  continuity  constraint  is  to  distinguish  between 
these  two  situations. 

Wc  now  derive  an  explicit  form  for  the  constraint.  We  know,  by  applying  Rice's  theorem 
[Grimson.  1981b,  p.  78],  that  the  expected  distance  between  zero-crossings  of  the  DOG  filter  of 
the  same  contrast  sign  is  given  by 

5.29u> 
e  = - . 

2v/2 

Then  given  uncorrclatcd  left  and  right  zero-crossing  descriptions,  the  probability  of  no  match  at 
a  particular  disparity  is 


and  if  p  denotes  the  horizontal  width  of  a  matching  pool,  and  t>  denotes  its  vertical  extent,  the 
probability  of  no  match  within  a  pool  of  dimensions  p  X  v  is 


and  hence  the  probability  of  a  match  in  this  pool  is 


11 


Now  we  consider  the  probability  of  randomly  ni. itching  segments  of  .1  contour.  Given  a  contour 
segment  ot  length  A  in  one  im.ige.  we  went  to  determine  the  probability  that  w  of  those  A 
points  has  a  match  within  the  corresponding  pool  in  the  oilier  image,  when  the  two  images  are 
uncorrelaied.  Clearly.  this  is  gi\en  by 

k  >11 

0 

Unis,  given  some  threshold,  <.  on  the  expected  error  rate,  such  that  0  <  »  <  I.  we  can  determine 
constraints  on  the  length  of  a  matched  /ero-crossmg  contour  that  will  he  accepted  as  corresponding 
to  a  correct  match  That  is.  given  a  threshold  «.  and  a  value  for  die  number  of  unmatched  gaps 
in  the  contour,  k  »«.  we  can  find  line  minimum  length  A  of  a  contour  such  that  l\,m  <  <■  In 
particular,  we  let 

l,  —  min  {A  |  <  r } 

denote  the  threshold  on  die  length  of  matched  contour  required  to  satisfy  the  figural  continuity 
constraint,  for  some  number  of  gaps.  Note  that  this  is  a  function  of  the  expected  error  threshold 
<.  as  well  as  the  horizontal  pool  si/e  p.  the  vertical  pool  size  t .  and  the  mask  size  w. 

Thus  we  have  derived  a  specific  form  for  die  figural  continuity  constraint,  namely  that  the 
length  of  contour  that  must  he  matched,  as  a  function  of  die  error  threshold,  as  well  as  the 
parameters  listed  above  is  given  by  the  values  of  f,. 

2.4.2.  Vertical  disparity 

One  of  the  implicit  assumptions  of  the  Marr-Poggio  algorithm  is  that  the  geometry  of  the 
two  sensors  yields  horizontal  cpipolar  lines.  While  it  is  possible  10  rectify  die  images  to  remove 
gross  geometric  distortions  caused  by  factors  such  as  uclotorsion  and  camera  ult.  there  arc  likely 
to  be  local  distortions  of  the  cpipolar  geometry  due  to  geometric  distoruons  in  the  sensor,  or 
perspective  effects.  Furthermore,  the  discrete  nature  of  die  zero-crowing  representation  may  cause 
small  variations  (on  the  order  of  a  pixel)  in  the  positions  of  die  zero-crossings.  These  factors 
suggest  that  although  large  scale  effects  on  the  cpipolar  geometry  can  he  handled  by  some  type 
of  image  rectification,  there  may  still  be  small  scale  variations  on  the  cpipolar  geometry  that  must 
be  handled  by  the  matching  algorithm. 

In  light  of  this  discussion,  it  is  interesting  to  note  recent  evidence  concerning  the  effect  of 
vertical  disparities  on  the  human  stereo  system.  It  has  been  observed  psychophysically  [Duwacr 
and  van  den  Urink.  1981a.  1 98 1  h]  that  while  up  to  a  degree  of  vertical  dispantv  can  be  tolerated 
by  the  human  stereo  system,  almost  all  of  this  is  handled  by  invoking  an  eye  movement  to  align 
the  images.  In  the  absence  of  eye  movements  [Nielsen  and  Poggio.  1983).  only  about  2-4  minutes 
of  vertical  disparity  can  be  tolerated.  One  interpretation  of  these  results  is  that  the  stereo  matching 
mechanism  is  capable  of  performing  the  correspondence  process  only  if  the  images  have  been 
nearly  rectified,  and  that  grosser  distortions  of  the  cpipolar  geometry  arc  corrected  for  by  changing 
the  alignment  of  the  eyes. 
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Intel  cslingly.  the  original  implementation  ol  i lie  Marr-Poggio  algorithm  cssciiiiallv  man  pouted 
this  clleci  in  the  lollo%\ mg  manner.  Initially.  the  vertical  disparity  was  assumed  iu  he  zeto  (although 
if  monocular  cues  were  mtorpoiaicd  into  the  system.  it  would  he  possible  to  pieeompuic  .1 
less  arbitrary  vertical  alignment  ol  the  images  |  Man  and  Poggio.  19X0]).  and  llic  matching  was 
pci  formed  at  the  coarsest  resolution.  Because  of  the  large  si/e  of  the  filter,  the  effects  of  vertical 
disparity  in  the  images  is  less  likely  to  ailed  the  performance  of  the  matcher.  Suppose  we  consider 
some  region  of  the  image,  and  use  the  disparity  information  computed  by  the  coarse  filter  to 
align  the  images.  If  the  finer  filtered  images  cannot  he  matched  (or  can  he  only  vciy  sparsely 
matched),  this  can  he  taken  as  an  indication  dial  the  images  have  been  correctly  aligned  to  remove 
am  horizontal  disparuv.  hut  that  a  small  amount  of  vertical  disparity  may  be  present.  Ihus.  by 
applying  small  alignment  corrections  in  the  vertical  direction,  the  images  can  be  brought  into 
alignment,  therein  increasing  the  density  of  computed  disparity  values.  'Ibis  behavior  was  observed 
in  computational  experiments  on  u  number  of  natural  images. 

Although  the  performance  of  the  Marr-Poggio-Grimson  implementation  was  qualitatively 
consistent  with  the  psychophysical  data,  the  use  of  a  stringent  cpipolar  matching  geometry  was 
probably  too  strict.  In  other  words,  while  it  is  feasible  to  use  gross  alignments  of  the  images  to 
account  for  large  scale  geometric  effects,  a  strict  cpipolar  matching  strategy  may  be  too  sensitive 
to  small  local  distortions  in  the  zero-crossing  descriptions,  cither  due  to  geometric  or  perspective 
effects,  due  to  noise  in  the  early  processing,  or  due  to  discretization  effects.  As  a  consequence, 
it  is  suggested  that  the  matching  of  zero-crossings  be  relaxed  slightly  (Note  that  in  tile  original 
Marr-Poggio  algorithm,  the  use  of  oriented  filters  suggests  that  vertical  disparity  effects  would  be 
more  tolerable.)  For  example,  suppose  there  is  a  zero-crossing  at  some  point  ( z,y )  in  the  left 
image.  The  initial  Marr-Poggio  implementation  would  search  for  a  corresponding  zero-crossing  in 
the  region 

{(xfjl)  I  x  -*•  d  -  u>  <  *'  <  x  +  d+tr} 

in  the  right  image.  Instead,  we  propose  to  search  for  a  corresponding  zero-crossing  in  the  region 

{(*',y')  (  x  -  d-  v  <  x  <  x  ♦  d*  ii  ,  y  -  <  <  y'  <  y  +  t) 

where  <  is  on  the  order  of  1  or  2  scan  lines.  Note  that  while  this  will  make  the  matcher  less 
sensitive  to  small  distortions  or  noise,  it  will  also  reduce  Oic  accuracy  of  the  matching  process, 
since  a  single  zero-crossing  point  in  one  image  could  potentially  be  matched  to  all  the  points  on 
a  zero-crossing  segment  lying  within  this  window  in  the  second  image,  yielding  a  small  range  of 
disparity  values,  rather  than  a  single  one.  Ihc  effect  will  become  more  noticeable  as  the  orientation 
of  the  zero-crossing  segment  approaches  horizontal. 

We  also  note,  while  discussing  vertical  disparity,  that  several  authors  have  recently  proposed 
using  measured  vertical  disparities  to  obtain  the  additional  camera  parameters  needed  to  convert 
disparity  directly  into  distance  (Mayhcw.  1982.  Longuct- Higgins,  1982;  Mayhcw  and  J.onguet- 
Higgins,  1982;  Prazdny.  1982.  198.1).  While  the  algorithm  described  here  docs  not  use  the  vertical 
isparily  information  is  this  manner,  it  is  possible  to  augment  the  algorithm  to  do  so. 
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2.4..F  (  on! ml  strategies  ;ind  search  spaces 

Finding  ihc  correspondence  between  points  in  die  two  images  can  be  considered  .is  a  problem 
of  searching  a  space  of  possible  correspondences  for  die  correct  solution.  In  considering  this  t s pc 
of  formulation,  two  separate  issues  must  be  considered. 

1.  Restricting  the  set  of  possible  alternatives  The  key  point  is  to  improve  the  reliability  of 
die  computation.  by  attempting  to  ensure  no  false  positives,  and  as  few  false  negatives 
as  possible.  i.C.  no  incorrect  matches,  and  as  few  cases  of  no  answer  as  possible. 

2.  Strategies  for  efficiently  searching  die  space  of  alternatives  to  find  the  correct  one. 

We  wish  to  separate  diesc  two  issues,  since  while  they  arc  related,  techniques  used  to  reduce 
the  space  of  possible  correspondences  need  not  be  incxtricablv  tied  to  particular  strategies  for 
searching  for  those  correspondences. 

First,  we  consider  means  for  reducing  the  space  of  alternatives  dial  must  be  explored  in 
order  to  find  the  correct  correspondence.  Assume  that  each  image  is  n  x  «.  1  lien  initially  each 
point  in  one  image  has  n'2  possible  matches.  As  well,  there  arc  rr  points  in  each  image,  so  a 
straightforward.  British  Museum  style,  search  aigoridim  requires  n *  total  comparisons.  Flow  can 
we  reduce  this? 


Feature  point  systems,  while  suffering  a  reduction  in  the  density  of  computed  depth  values, 
can  significantly  reduce  the  space  of  possible  correspondences,  by  attempting  to  restrict  the 
computation  to  "distinguishable"  points  in  the  images.  If  the  density  of  feature  points  is  p.  then 
the  set  of  possible  matches  becomes  pn 2  and  die  number  of  total  comparisons  under  the  British 
Museum  algorithm  is  p^n*.  Note  that  in  the  ease  of  the  Marr-Poggio  algorithm,  p  varies  with  the 
size  of  the  initial  filter.  In  particular,  the  expected  density  of  zero-crossings  is 

1 


where 


2\/2 


by  the  analysis  of  [Grimson,  1981.  p.78].  Thus,  the  number  of  possible  candidates  for  a 
correspondence  reduces  to 


n 

ew 


and  the  total  number  of  comparisons  involved  in  the  search  is 


The  next  major  constraint  that  can  be  applied  to  the  matching  process  is  the  cpipolar  one. 
If  we  take  a  liberal  interpretation  of  this  constraint,  dicn  a  point  on  line  y  can  be  matched  only 
to  points  on  lines  v'  such  that  y  - 1,  <  v'  <  y  +  i>.  for  some  constant  v.  In  this  ease,  each  point 
has  a  space  of  possible  matches  on  the  order  of 

(2v  +  l)n 


c  U! 


and  the  total  number  of  comparisons  over  the  whole  image  is 
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I  lie  final  mulching  umsii.nni  used  in  tlte  Murr-I’oggio  algorithm  is  that  of  continuity.  vs  Inch 
is  intended  lo  reduee  the  number  of  possible  matching  candidates  Irom  order  «  to  l.  Of  course, 
one  can  clearly  construct  situations  in  which  the  number  of  matching  candidates  is  not  reduced  to 
a  unique  solution,  but  in  general,  as  die  discussion  in  die  pres  ions  section  indicated,  die  continuity 
constraint  can  he  structured  so  as  to  reduce  the  probability  of  false  matches  to  virtually  /cm. 

Note  dial  all  ol  the  constraints  introduced  in  this  discussion  have  been  matching  constraints, 
that  is.  they  have  reduced  the  number  of  possible  matches  lor  a  given  point.  As  a  consequence, 
die  total  si/e  of  the  search  space  has  also  been  reduced,  but  it  is  important  to  note  dial  all  die 
discussion  to  this  point  has  been  independent  of  die  particular  search  strategy  to  be  employed  in 
finding  corresponding  matches,  fins  distinction  between  the  use  of  matching  constraints  to  alter 
the  space  of  possible  correspondences,  in  order  to  ensure  the  existence  of  a  unique  solution,  and 
die  use  of  efficient  techniques  for  searching  die  space  of  solutions  to  find  the  correct  solution, 
is  important  in  light  of  die  final  constraint  of  the  Marr-Pogeio  algoridim.  the  use  of  multiple 
resolution  representations  of  the  image. 


One  use  of  multiple  resolution  representations  is  in  dealing  with  false  targets.  For  example, 
if  a  fine  resolution  feature  point  representation  has  more  dian  one  possible  match  for  a  particular 
point,  the  correspondence  information  at  a  lower  resolution  representation  can  be  used  to  resolve 
this  ambiguity,  lhis  was  one  of  the  main  uses  of  multiple  resolution  representations  in  the  original 
Marr-Poggio  algorithm.  This  disambiguation  technique  was  also  imertwined  wnh  an  efficient  search 
algorithm  as  well,  however.  In  particular,  the  matching  of  finer  level  representations  is  directly 
driven  from  coarser  level  correspondences  (whenever  possible).  Not  only  docs  this  provide  one 
means  of  avoiding  false  targets,  but  it  is  also  an  extremely  efficient  method  for  searching  the  space 
of  possible  matches,  as  is  indicated  in  the  following  discussion. 


Let  w0  denote  the  si/c  of  the  smallest  image  filter,  and  assume  that  we  have  k  +  l  such 
filters,  each  one  doubling  in  si/c  from  the  previous  one.  Then,  by  the  discussion  above,  we  know 
that  at  the  coarsest  level,  we  must  search  on  the  order  of 

n8  (2v  +  l)n  (2v  +  1)n* 
r2kwo  c2kwo  r828*u>o 

alternatives  in  order  to  find  correspondences  for  all  the  feature  points  in  this  level  of  representation. 
If  the  matching  process  is  driven  in  a  coarsc-to-finc  manner,  then  at  each  subsequent  level,  the 
image  representations  arc  aligned  based  on  previous  matching,  and  for  each  feature  point,  we 
need  only  search  an  area  of  si/c  cw  to  find  the  correct  match.  Thus,  in  principle,  we  need  only 
compare 


(2v  +  l)cu; 


=  (2t>  +  1) 


points.  This  implies  that  at  each  of  the  subsequent  levels,  we  must  search  2v  +  1  comparisons  for 
each  of 


2'cwo 
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feature  points.  I  Inis,  the  lol.ll  number  of  comp.niMHis  needed  is  on  the  older  ol 

7/‘(2v  1-  I) 

2'  rwu 

points,  or  equivalently. 

(2”  +  |)«*  j.(  I  _  n  __  I 

rti'u  [  2*  1  r i/n2-*  j 

points.  I  his  is  still  0(ji;i)  hut  as  <•  increases,  we  sec  that  the  amount  of  search  involved  in  finding 
feature  point  correspondences  reduces  to  the  order  of  llic  dimensions  of  the  image,  i.c.  n*.  Ihus. 
one  of  die  advantages  of  multiple  level  representations,  besides  its  use  in  disambiguation  of  false 
targets,  is  its  efficiency  in  finding  die  correspondences  especially  in  situations,  such  as  the  human 
visual  system,  in  which  high  resolution  information  is  only  required  o'er  small  portions  of  the 
image  at  any  one  time.  (Compare  this  estimate  of  O(ir’)  pointwisc  comparisons  with  die  results 
of  [Ohta  and  Kanadc  83]  of  0(ns)  primitive  computations  for  a  general  3-1)  search  algorithm  and 
0(n3)  primitive  computations  under  certain  limiting  assumptions.) 

It  is  curious  to  note  as  an  aside  that  one  could  use  the  above  expression  to  predict  the 
number  of  levels  of  representation  tor  equivalently,  the  number  of  V2(7  filters)  needed  to  reduce 
the  search  space  to  0(n2).  If  we  consider  an  area  spanning  8"  on  a  side  with  fovcal-lcvel  receptor 
spacing,  then  a  straightforward  calculation  predicts  that  B  filters  are  necessary  to  reduce  die  search 
space  to  0(n2).  Interestingly,  recent  investigations  by  Wilson  [1983]  provide  evidence  for  6  such 
filters. 

If  the  key  consideration  is  not  speed,  but  rather,  high  resolution  depth  information  at  all 
points  in  the  image,  it  is  possible  to  propose  an  alternative  search  strategy,  while  still  taking 
advantage  of  the  disambiguation  properties  of  multiple  resolutions  representations.  Rather  than 
driving  the  matching  process  directly  from  the  coarse  level  information,  we  can  instead  use  that 
information  only  when  needed  for  disambiguation. 

As  in  the  original  Marr-Poggio  algorithm,  for  any  given  alignment  of  the  images  (fixation  of 
the  eyes),  the  search  space  is  restricted  to  a  range  on  the  order  of  ru,  so  as  to  avoid  the  possibility 
of  false  targets.  Any  candidates  that  satisfy  all  the  matching  constraints  arc  accepted  as  possible 
correspondences,  and  stored  away.  If  the  total  range  of  disparity  over  die  entire  image  is  within 
this  cu'  range,  then  we  arc  done.  If  not.  however,  then  the  same  matching  process  is  repeated  at 
some  desired  spacing  in  depth,  and  the  algorithm  is  swept  across  the  entire  range  of  disparity. 
While  for  each  given  alignment  of  the  images,  only  one  match  is  possible,  it  may  be  the  case  that 
matches  for  the  same  feature  points  will  be  found  at  very  different  alignment  positions.  If  this 
is  the  case,  then  this  false  targets  problem  can  be  disambiguated  by  choosing  the  alternative  that 
best  agrees  with  the  correspondence  information  obtained  at  coarser  levels.  Clearly,  such  a  search 
algorithm  requires  a  sweeping  of  fixation  across  the  entire  range  of  depths,  and  while  it  will  result 
in  high  resolution  depth  information  everywhere  in  the  image,  it  docs  so  at  the  expense  of  speed. 
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3.  A  Modified  Mtnr-I'o^io  Stereo  Matcher 


We  h.ive  incorporated  .ill  of  these  considciutions  into  .1  new  ulgoridim,  which  we  describe 
below  While  the  modifications  were  m.ide  in  p.iri  bec.mse  of  recent  psychophysical  evidence 
concerning  the  luini.m  stereo  system,  we  will  discuss  its  possible  merits  as  a  stereo  system  for  such 
applications  as  automatic  acii.il  cartography  and  robotics  in  the  next  section. 

3.1.  The  Modified  Mgorithm 

We  will  first  outline  the  basic  algorithm,  and  then  provide  more  detailed  descriptions  of 
each  of  die  steps.  I  hc  basic  steps  of  the  matching  algoridim  can  he  summarized  in  the  following 
manner.  Note  that  steps  0-3  are  identical  to  die  original  algorithm.  The  main  concentration  on 
modifying  the  algorithm  has  been  at  the  matching  stage.  Also  note  di.it  steps  4. 1-3.3  arc  an  instance 
of  Marr's  principle  of  least  commitment  (Marr.  1982], 

3.1.1.  Outline  of  the  Algorithm 

(0)  l  oop  oi er  leiels:  W'c  initially  choose  the  coarsest  level  of  representadon,  i.e.  the  one 
corresponding  to  the  largest  image  filter,  and  iterate  by  choosing  successively  finer  levels  of 
representation. 

(1)  Convolution:  Given  a  level  of  representation,  the  left  and  right  images  are  convolved 
with  the  V~(l  filters  of  the  corresponding  size. 

(2)  Zero-crossings:  Given  the  convolved  images,  the  nontrivial  zero-crossings  are  located  and 
marked  with  their  contrast  signs.  These  zero-crossings  descriptions  form  the  basic  representations 
from  which  correspondences  will  be  sought 

(3)  l  oop  over  fixation  position:  The  relative  alignments  of  the  two  images  are  chooscn.  The 
simplest  method  is  to  initially  choose  an  alignment  corresponding  to  some  lower  limit  on  the 
disparity  of  the  images,  and  slowly  increment  this  offset  until  some  upper  limit  on  the  disparity 
is  reached.  This  increment  could  be  a  pixel  at  a  ume.  or  in  terms  of  some  larger  fraction  of  the 
width  of  the  matching  area  for  a  given  fixation  position. 

(4)  Matching: 

(4.1 )  Feature  point  matching:  Given  a  pair  of  zero-crossing  representations,  from  the  current 
level,  and  given  a  fixation  position  defining  the  relative  alignments  of  the  two  images,  feature 
point  matching  is  applied.  For  each  feature  point  in  one  zero-crossing  description,  this  involves 
searching  an  area  of  die  other  zero-crossing  description  for  a  zero-crossing  of  the  same  contrast 
sign.  Ibis  area  has  a  vertical  extent  about  the  same  horizontal  line  in  the  other  image  that  is 
limited  to  a  small  number  of  scan  lines,  and  a  horizontal  extent,  of  width  defined  by  the  size 
of  the  underlying  image  filter,  about  the  same  position  in  the  other  image,  offset  by  the  current 
reiadve  alignment. 

(4.2)  Figural  continuity:  Once  all  the  feature  points  have  been  matched  for  the  current  level 
of  representation  and  the  current  fixation  alignment,  figural  continuity  constraints  arc  applied 
to  prune  dtc  incorrect  matches.  This  involves  tracing  the  zero-crossing  contours,  searching  for 
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contiguous  m. itched  segments  of  those  contours  whose  length  exceeds  .1  threshold  whose  \ .due  c.m 
he  determined  </  />»«*»■<  horn  the  properties  of  die  underlying  V-G  filters. 

(4..1)  /)/\/x/n/i  update:  Ain  m. itched  feature  point  contours  which  puss  the  figur.il 
conlnuiilx  test  arc  then  added  to  disparity  map.  recording  die  relcxam  disparity  for  each  feature 
point  in  the  accepted  contour  segments. 

(5)  Loop:  Once  this  compulation  of  disparities  within  the  defined  range  about  the  current 
image  alignment  has  been  completed,  die  fixation  position  is  updated  by  looping  to  step  (3). 

(6)  Disambiguation:  When  .ill  the  fixation  positions  have  been  processed,  we  are  left  with  a 
disparity  map  representation  that  contains  all  matched  /cro-crossing  segments,  with  dieir  associated 
disparities.  Wc  now  check  dus  map  for  possible  double  matches.  Any  such  ambiguities  arc  rcsolxed 
by  checking  die  disparities  within  die  same  region  of  the  representation  at  die  previous  level  (if 
dicre  is  one)  and  accepting  only  those  disparity  values  at  die  current  level  that  are  consistent 
with  those  values  (i.c.  lie  within  a  predefined  range  of  the  coarser  level  disparities).  If  this 
disambiguation  does  not  succeed,  either  because  dicre  is  no  coarser  level,  because  there  arc  no 
disparity  values  within  the  same  image  region  at  the  coarser  level,  because  none  of  the  current 
level  disparities  lie  w  ithin  range  of  die  coarser  level  ones,  or  because  more  than  one  of  die  current 
level  disparities  arc  consistent  with  coarser  level  disparities,  then  all  the  alternatives  are  discarded. 

(7)  Loop:  Once  die  final  disparity  map  for  the  current  level  has  been  completed,  the  process 
proceeds  to  the  next  finer  level  of  representadon,  by  looping  to  step  (0). 

(8)  Consistency:  When  all  the  levels  of  disparity  information  have  been  computed,  one 
final  test  is  possible.  Hath  disparity  value  at  the  finest  level  of  representadon  can  be  tested  for 
consistency  by  checking  that,  within  the  same  region  of  the  previous  disparity  representauon,  there 
is  at  least  one  disparity  value  that  is  consistent  with  the  current  value. 

3.1.2.  Detailed  description  of  the  algorithm 

Wc  now  turn  to  a  more  detailed  description  of  the  different  stages  of  the  algorithm. 

(1)  Convolutions:  As  in  the  previous  implcmentauon,  convolve  the  images  L,R  with  V2G(u) 
fillers,  for  different  values  of  w.  For  noiational  convenience,  we  let 

LCw(x,y)  =  V2G(w)»  L 
RCw(x,y)  =  V2C(w)*R 

denote  the  left  and  right  convolutions,  that  is.  for  different  widths  w.  the  convolved  image  forms 
a  two-dimensional  array  indexed  by  x  and  y.  Generally,  wc  use  only  3  or  4  values  of  w.  for 
example,  w  =  5,  ft,  17,33  pixels. 

(2)  Zero-Crossings:  As  in  die  previous  implementation,  compute  the  zero-crossings  of  the 
convolved  images.  We  let 
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U‘u.(s,ji)  positive  /cro-crossings  of  l.(',„(x,y) 

(s,  ji)  negative  /cro-crossings  of  l.('u.(i,y) 
Ul„[s,y)  hori/oninl  /cro-crossings  of  l-(\„(i.y) 
.ill  zero-crossings  of  L('w(x,y) 
l{ l\,  (j,  y)  positive  /cro-crossing  of  l(('u,(i,y) 

tf.\„.(s,y)  --  negathe  /cro-crossings  of  y) 

Hllu\s,y)  —  horizontal  /cro-crossings  of  U('u,(x,y} 
U /,„(/, ;/)  —  .ill  /cro-crossings  of  y) 

Fach  of  these  is  a  bit  map. 


(3)  Fixation  position:  Initially  choose  tlic  alignment  of  tlic  two  images  to  correspond  to  some 
preset  lower  limit,  and  increment  b>  a  specified  ainonnt  until  tlic  alignment  exceeds  some  preset 
upper  limit. 

(4)  Matching:  The  matching  algorithm  can  be  subdivided  into  three  sections.  First,  the 
feature  points  are  matched:  then,  figural  continuity  is  applied  to  die  resulting  matches:  and  finally . 
any  ambiguities  between  matches  are  resolved. 


(4.1)  Feature  point  matching.  The  feature  point  matching  portion  of  the  algorithm  can  be 
summarized  as  follows.  Suppose  we  arc  dealing  with  zero-crossing  descriptions  corresponding  to 
some  particular  filter  of  size  u<u.  Given  a  disparity  d0.  we  construct  an  A’  X  A’  X  2 u<0  local  disparity 


array  M  : 


M(x.  y,  r) 


»+/  1) 

Y  RPwJx  +  d0  +  r,  v)  J 

l 

.v  —  y-t  1) 

»+< 


Y  RNWo(x  +  d0  +  r,  v) 


where  0<x<N,0<y<N,  and  -u  <  r  <  w.  Thus,  each  slice  of  M(x,y,r0)  given  by  a 
value  r0  of  r  is  a  set  of  matched  feature  points,  within  a  vertical  range  of  ±<,  for  a  local  disparity 
value  r  about  the  current  convergence  value  d0.  Note  that  positive  zero-crossings  arc  matched  to 
positive  ones,  and  negatives  to  negatives,  over  a  vertical  range  of  ±e,  and  over  a  horizontal  range 
of  ±u>  about  the  current  alignment. 


(4.2)  Figural  continuity. 

In  order  to  distinguish  correct  from  random  feature  point  matches,  we  apply  a  figural 
continuity  constraint,  by  restricting  the  accepted  matches  to  those  extended  contour  segments 
whose  length  is  sufficiently  large.  First,  we  need  a  means  of  defining  a  path  along  a  zero-crossing 
contour.  If  LZw„{x,y )  =  1,  that  is  if  there  is  a  zero-crossing  at  this  point,  then  we  define 
=  (tx,  v)  to  be  the  next  point  along  the  zero-crossing  contour.  In  other  words,  if  the  vector 
r  =  (i,  y)  is  an  index  into  the  zero-crossing  array,  and  if  lZWa(x0,y0)  =  LZWo(r0)  =  1  then  the 
ordered  sequence 

foi  II  ,U*o  M,  /^,UZ0(/t,Ul0  (ro)),  -  -  • 

traces  out  a  zero-crossing  contour. 

Then,  given  a  threshold  <  on  the  expected  error  rate  (0  <  <  <  1).  we  need  a  threshold  on 
the  length  of  the  matched  contour  segments.  By  the  previous  discussion,  this  is  given  by 
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where  /’*.*  is  given  by  equation  (1).  Ihus.  we  lei  Z„ .  Z , , Z..  denote  Ihe  contoiu  lengths  required 
by  contours  of  0.  1  and  2  gaps  respectively  then  the  piocedurc  lor  fiuur.il  continuity  tun  he 
specified  as  follows. 


Figural  Continuity  Procedure 

Compress  all  the  matches  into  one  representation: 

MT[i,y)~  \J  M(i,y,r)  Vi,  y. 

r  =  —  u» 

Initialize  die  output  array: 

SM{x,y)  =  0  Vi,  y 

For  each  point  r0  —  (ru,!/o)  such  tliat  ACr(r0)  =  1,  apply  the  following  procedure.  Set: 

g  =  0  ;  gap  counter 

t  =  1  ;  length  counter 

S  =  { r0}  :  contour  tested 

p  =  r0  ;  contour  pointer. 

(0)  If  /t.u,0(p)  =  r0 

then  we  have  completed  tracing  the  contour,  and  it  is  not  long  enough,  so  exit  without 
saving  the  contour; 

else, 

if  LHW0(fLiWt( p))  -  1 

then  the  next  point  is  a  horizontal  zero-crossing,  so  go  to  (1); 
else, 

if  MT(/L,Wa(p))  =  0 

then  there  is  a  gap  so  increment  the  gap  counter:  g  =  g  +  1 
and  go  to  (1); 

else  increment  the  length  counter.  1  =  1  +  1 
and  continue. 

(1)  Ug  >  2 

then  the  gap  is  too  large,  so  exit  without  storing  the  contour; 

else, 

if  0  =  2, 

then, 

if  t  >  t2 

then  save  the  contour:  Vpe  5, set  SM(p)  =  SM( p)V  MT( p) 
else  go  to  (2). 

else, 

•f  9=1, 
then, 

if  l  >  l, 
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then  s.ne  the  contour:  Vp  ,  5,  sol  .s  A / ( p )  N  A/(p)  V  A/ 7  (p) 

else  go  to  (2). 

else. 

if  y  «. 

Ilien. 

if  f  >  to 

then  vivo  the  contour:  Vp  f  5,  set  SM(p)  -  S.M(  p)V  A77’(p) 
else  go  to  (2). 

(2)  Inclement  the  contour  collection,  setting  S  =  SU{/i„»0(P)) 
and  increment  the  contour  pointer,  setting  p  =  /,..u.0(p). 

Go  to  (0). 


(4.2)  Disparity  updating. 

When  this  procedure  is  finished.  SM(p)  contains  all  the  matches  for  this  alignment  that  pass 
the  figural  continuity  constraint.  Nov,,  we  need  to  update  the  global  disparity  array  DWo(x,y,d). 
This  is  accomplished  by  looping  over  all  values  of  p  and  applying  the  following  procedure. 


Disparity  Update  Procedure 

If 


SM(p)  —  1, 


then  set 


,(p. 


Er=-u,5M(p,o)(d0  +  O) 


)- 


That  is,  we  mark  a  1  at  the  point  in  the  three-dimensional  disparity  array  corresponding  to 
the  average  disparity  of  the  local  matches.  Thus  for  each  d,  the  set 

{D„0(p,d)|Vp} 

is  a  disparity  slice  of  the  matched  images. 

To  create  the  total  disparity  array  D ,  we  can  simply  let  do  range  between  preset  limits  d(  to 
and  iterate  over  the  previous  steps.  Note  that  this  is  an  extremely  simple  control  strategy,  which 
could  clearly  be  augmented,  for  example  along  the  lines  suggested  in  the  original  Marr-Poggio 
theory.  In  cases  where  a  detailed,  fine  resolution,  disparity  map  is  desired,  this  simple  control 
mechanism  should  suffice.  In  situations  in  which  speed  is  a  critical  factor,  an  attention  focussing 
mechanism  that  uses  coarse  disparity  information  to  guide  finer  resolution  matching  is  probably 
essential. 


The  above  algorithm  has  been  specified  for  a  single  operator  size  u>0  and  can  be  applied  at 
each  of  the  four  st/cs  specified  earlier,  TTic  original  Marr-Poggio  theory  proposed  that  a  coarse  to 
fine  matching  strategy  he  used  to  guide  the  matching  at  finer  resolution  representations,  in  pan 
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because  ihc  unihiguitv  of  mil h  matches  mu  cases  with  the  iikicasmg  density  < it  the  /ciu  codings 
While  we  have  split  oil  the  contiol  stuicgv  .ispats  of  this  piopos.il  by  sweeping  the  images 
through  the  entire  range  of  possible  disp.imics  tor  each  operator.  the  use  of  multiple  resolution 
operators  as  a  means  of  disambiguation  soil  rem.uns  ,i  possibility. 

(5)  l  oop:  Simply  loop  to  step  (  ')  to  inuement  over  all  possible  image  alignments. 

((>)  Disambiguation.  In  partieutar.  while  only  a  single  inaleh  will  he  assigned  a  /cro-crossing 
point,  for  eaeh  alignment  of  the  images,  i/„.  it  is  possible  that  moie  than  one  contour  will  be 
matched  to  the  point,  as  the  disparity  sweeps  through  the  range  <lf  <  il,4  We  can  use  tlic 

disparity  information  obtained  at  coarser  channels  to  help  disambiguate  this  ease,  for  each  channel 
si/e  tr().  we  perfomi  die  following  operations. 

First,  wc  project  die  disparity  array,  setting.  Vp: 

f  d.  if  h„„( p.  a)  —  6ai 

I’lK  0(p)  =  <  null.  if  l)„v( p.u)  =  0,  Va 
l?.  if  otherwise. 

Thus,  if  there  is  exactly  one  match.  rDW0( p)  equals  the  disparity  value  of  that  match;  if  there 
is  no  match,  it  is  set  to  null;  and  if  there  is  more  than  one  match.  />l)Uo( p)  is  marked  with  the 
special  character  “?".  If  w0  is  currently  set  to  the  largest  possible  filter  si/c.  then  nothing  can  be 
done.  If  it  is  set  to  a  smaller  filter  size.  however,  then  let  w,  denote  the  next  largest  filter  size  and 
proceed  in  the  following  manner. 


Disambiguation  Procedure 

For  each  point  p  such  that  p)  =?,  let 

z\  =  {a  |  DWa(p,a)  =  1} 

denote  the  set  of  possible  matches  for  this  point. 

If  there  is  a  point  p'  in  a  neighbourhood  (p)  about  this  point,  such  that 

PDWI[  pVnull 

and 

and  such  that 

\rDWf(P')-ai\  <  | 

for  some  a,  e  A, 

then  a,  is  a  legitimate  disparity  value. 

If  there  is  exactly  one  legitimate  element  o,  of  A, 

then  set 

„(p)  =  a, 


else  set 


PDwAp)  =  null. 
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In  ill  is  m. inner  vie  create  the  disparity  in..p  /’/)„,  fur  the  anient  filter  si/c  u(l. 

O  I  nop:  Wo  c.ni  iteiale  tins  piocedme  over  decreasing  values  of  w„.  W  hen  ilns  is  finished, 
we  have  a  series  ol  disparity  maps  /'/),.  o!  increasing  lesolution  as  u-  decreases. 

( S )  (  oiisistenei.  I  he  disambiguation  process  described  above  can  be  considered  as  a  type 
ol  consistency  check.  I  hat  is.  it  theie  are  tvvo  contours  dial,  to  within  the  limits  of  the  figural 
continuity  constiaint.  match  a  given  contoui.  we  can  use  coarser  level  inhumation  to  eliminate 
the  incorrect  match.  I  Ins  relies  on  the  assumption  that  the  correct  contour  will  be  accepted  b> 
figtiral  continuitv.  There  may  also  be  circumstances  m  which  the  correct  contour  is  not  accepted, 
for  example  because  it  is  occluded  in  one  of  the  images,  but  in  which  an  incorrect  contour  passes 
the  figtiral  continuitv  constraint,  and  is  accepted  as  a  correct  match.  While  this  occurs  very  rarely 
(empirical  observations  suggest  Ui.it  less  than  (I  00i  of  the  matched  zero-crossing  contours  have  this 
problem),  it  is  possible  to  apple  a  consistent}  cheek  to  the  computed  disparit}  maps  to  remove 
this  possibility 


Consistency  Procedure 

Given  two  adjacent  filter  sires  w,  <  ur,vp. 
if  /'/>„,,( p)/  null 

then. 

if  Au,,(p)  is  empty,  leave  />/}„,,( p)  as  it  stands. 

else. 

if  there  is  a  point  p'  £  Au,,(p)  such  that  :PDu.(p)  -  f'T>u,,(p')l  <  t 
then  leave  f’DWi( p)  as  it  stands, 

else,  set  p)  =  null  as  it  is  not  consistent  with  the  coarser 

resolution  disparity  map. 


4.  Examples 

We  will  examine  two  different  types  of  stereo  imagery  in  this  section,  a  laboratory  scene 
with  many  of  the  characteristics  of  industrial  robotics  situations,  and  acnal  photographs  of  natural 
and  artificial  terrain.  The  intent  is  both  to  provide  a  means  of  examining  the  performance  of 
the  stereo  algorithm  outlined  in  the  previous  section,  and  to  consider  the  potential  applicability 
of  such  algorithms  to  automated  stereo  acquisition  of  depth  information,  both  in  robotics  and 
cartography. 

4.1.  Laboratory  Scenes 

Wc  consider  first  an  example  of  a  laboratory  scene,  shown  in  Figure  2.  The  scene  is  composed 
of  a  set  of  wooden  blocks,  of  different  shapes  and  lying  at  different  distances  from  the  cameras. 
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Figure  6.  The  set  of  mulched  zero-crossings  fur  the  Nocks  image. 

I  he  images  were  taken  with  an  Hitachi  CC1)  camera,  and  arc  288  by  224  pixels  each.  The  images 
contain  cre> -Ic\ els  from  0  to  255.  although  die  contrast  range  is  more  on  die  order  of  10  to  110. 
Ihc  cameras  were  positioned  rough!)  1500  mm  from  die  foremost  point  in  die  image,  namely 
die  front  of  the  cylinder,  with  a  separation  of  roughly  290  mm.  ]J>  roughly,  we  mean  that  the 
distances  were  measured  to  an  accuracy  of  a  few  millimeters. 

The  left  and  right  linages  were  convolved  with  four  different  si/cd  VJ6’  filters,  with  central 
widths  given  by  u-  =  17, 13,9  and  5  pixels  each.  These  convolutions  are  illustrated  in  Figure  3. 

The  zero-crossings  obtained  from  each  of  these  convolutions  are  shown  in  Figure  4.  Note 
by  comparison  to  the  convolutions  that  most  of  the  zero-crossings  in  the  support  plane  have 
very  shallow  gradients,  corresponding  to  low  contrast  changes  in  the  images.  The  positions  of 
such  zero-crossings  tend  to  be  sensitive  to  noise,  an  issue  to  which  we  will  return  shortly.  As 
has  been  demonstrated  in  earlier  implementations  of  die  Marr-Poggio  model,  the  density  of  the 
zero-crossings  is  directly  proportional  to  the  size  of  the  Vs(7  filter.  Note  also  that  the  zero-crossings 
of  the  largest  operator  tend  to  capture  coarse  features  of  the  objects,  such  as  their  occluding 
boundaries,  while  the  zero-crossings  of  the  smaller  operators  tend  to  capture  in  addition  finer 
details,  such  as  the  wood  grain  on  the  objects. 

Ihe  set  of  zero-crossings  from  the  finest  level  operator  to  which  a  matching  zero-crossing  is 
assigned  by  the  algorithm  is  displayed  in  Figure  6.  Note  that  the  figural  continuity  constraint  has 
removed  virtually  all  of  the  matches  corresponding  to  the  shallow  zero-crossings  of  the  background 
plane.  As  we  noted  earlier,  these  shallow  zero-crossings  tend  to  be  sensitive  to  noise  in  the  system, 
and  as  a  consequence  there  can  be  a  noticeable  variation  in  the  position  of  such  zero-crossings, 
due  to  this  noise  component.  One  of  the  advantages  of  the  algorithm  presented  here  is  that 
the  variation  in  zero-crossing  position  due  to  noise  will  generally  violate  the  figural  continuity 
constraint,  and  hence  such  matches,  with  inherently  noise  disparity  information  attached  to  them, 
will  be  pruned  from  the  final  disparity  data.  Wc  should  note,  however,  that  there  may  be  other 
edge  detection  techniques  that  arc  more  effective  at  removing  such  noise-sensitive  features  prior 
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Figure  7.  Contour  map  of  the  blocks  image, 
lo  the  matching  stage  [for  example,  Canny.  1983). 

'ITic  xcrtical  disparity  in  this  set  of  images  covers  a  range  of  ±3  lines.  To  obtain  the  results 
displayed  here,  the  algorithm  was  run  at  three  different  vertical  alignments,  and  the  results  of  each 
pass  of  the  algorithm  were  merged  into  a  single  disparity  array. 

Finally,  in  order  to  display  the  results  of  the  stereo  algorithm,  we  apply  the  following  process. 
We  first  interpolate  the  disparity  information  provided  by  the  finest  level  channel,  using  a  model 
of  visual  surface  reconstruction  based  on  the  image  irradiance  equation  (Grimson.  1982,  1983a. 
1983b).  To  do  this,  we  use  a  portion  of  an  efficient  multi-grid  implementation  of  an  alternative 
but  similar  surface  interpolation  model,  developed  by  Ter/opoulos  (1983.  1984).  Given  the  output 
of  this  process,  which  is  a  dense  reconstruction  of  the  disparity  over  the  image,  we  plot  isometric 
disparity  contours,  as  shown  in  Figure  7. 

The  isometric  disparity  contours  clearly  demonstrate  the  local  variations  in  depth  of  the 
objects,  as  computed  by  the  stereo  algorithm.  It  can  be  seen  that  the  isometric  disparity  contours 
are  not  perfectly  parallel,  as  might  be  expected  from  the  shape  of  the  blocks.  This  indicates  that 
while  overall  the  computed  shape  of  the  objects  is  coned,  there  may  be  a  certain  amount  of  local 
variation  in  the  disparity  values,  leading  to  a  distortion  of  the  isometric  contours.  This  is  further 
illustrated  in  F'igurc  8,  which  shows  a  perspective  view  of  the  reconstructed  surfaces  of  the  blocks. 

To  further  cvalutc  the  performance  of  the  algorithm,  especially  the  extent  of  this  local 
variation,  we  performed  the  following  additional  tests.  First,  the  disparity  information  was  convened 
to  actual  distance  values,  based  on  the  separation  of  the  cameras,  the  angles  of  convergence  of  the 
cameras  and  the  size  of  each  individual  pixel.  Ihesc  parameters  were  measured  for  the  geometry 
used  lo  record  the  original  stereo  images,  and  thus,  the  distances  from  the  camera  to  points  in 
the  image  were  computed,  lhc  following  table  records  the  computed  and  measured  distances,  in 
millimeters,  for  a  selected  set  of  points  in  the  image. 
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Figure  8  Perspective  vie*  of  the  reconstructed  blocks  surfaces. 


|  'I  able  1  -  Computation  of  Distance 

i  Points 

Computed 

Measured 

Difference 

Cylinder  front 

1506 

1517 

Wedge  front 

1647  ; 

1665 

18 

Mock  front 

1743 

1758 

15  j 

Cylinder  to  block 

237 

241 

4  ! 

Cylinder  to  wedge 

141  ; 

148 

7 

Cylinder  radius  -  left 

16 

17 

1 

Cylinder  radius  -  right 

18 

-  .  17  -a 

1 

W'edgc  -  depth  extent 

33  i 

35 

2  1 

Block  -  depth  extent 

<7 _ i _ 

50 

3 

The  first  three  entries  record  absolute  depth  measurements,  and  n  can  be  seen  that  the 
computed  distances  to  the  fronts  of  the  three  objects  are  off  by  approximately  15  mm.  out  of  a 
sensing  distance  of  1500  mm.  or  roughly  1%.  Note  that  this  transformation  to  absolute  distance  is 
sensitive  not  only  to  errors  in  the  computation  of  stereo  correspondence,  but  also  to  errors  in  the 
measurement  of  the  camera  geometry.  Given  the  coarseness  with  which  the  camera  parameters 
were  computed,  it  is  likely  that  this  is  the  major  source  of  error  in  the  computation  of  absolute 
distance. 

The  remaining  entries  of  the  table  record  relative  computed  distances,  both  for  separations 
of  the  objects,  and  for  the  depth  extent  of  the  objects.  The  fourth  and  fifth  entries  record  the 
computed  and  measured  relative  separations  of  the  objects.  The  final  four  entries  record  the  radius 
of  the  cylinder,  as  measured  to  the  left  and  right  of  the  front  of  the  cylinder,  and  the  change 
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in  depth  across  die  block  and  wedge.  Idr  this  particular  wounc  .inulc.  On  average.  the  ciroi  iri 
relative  depdi  tends  to  he  on  the  order  of  5-7  mm.  out  of  a  iot.il  depth  range  of  .Mill  mm.  lo 
pm  this  in  the  context  of  die  stereo  algorithm,  we  note  dial  for  diis  c.imer.i  geometry,  an  error 
in  stereo  matching  of  one  pixel  would  gixe  rise  to  a  depth  error  of  5-10  mm.  depending  on  the 
actual  location  in  the  image.  Thus,  die  errors  in  relative  depdi  arc  csscntiall>  on  die  order  of  a 
pixel  in  disparity. 

4.2.  \erial  Photographs 


The  second  type  of  images  to  which  wc  have  applied  the  stereo  algorithm  arc  aerial 
photographs,  both  of  natural  terrain  and  man-made  structures.  The  performance  of  the  modified 
stereo  algorithm  on  all  the  images  is  summarized  in  the  following  table. 


1  able  II  -  Stereo  Summary 

Blocks 

UBC  i  Ft. Sill  Phoenix  ;  Boeing 

Size 

2HX  x  22 1 

i  320  X  320  !  512  X  512 

512  X  512  320  X  320 

Disparity  Range 

56 

| 

13  !  51  , 

31 

13 

Zero-crossings 

11013 

1C801 

32907 

31103 

10612 

Matched  Z-Cs 

1780 

12310 

16073 

23890 

6608 

Matching  Krrors 

0 

9 

286  1 

78 

167 

After  Consistency 

0 

0  1  0 

o  i 

33 

The  row  labelled  size  indicates  the  dimensions  of  the  images.  The  row  labelled  disparity  range  lists 
the  disparity  range  of  each  image  pair,  in  pixels.  In  the  row  labelled  zero-crossings,  wc  indicate 
the  total  number  of  zero-crossing  pixels,  including  horizontal  ones.  In  the  row  labelled  matched 
z-c's.  the  number  of  such  zero-crossings  that  are  assigned  a  match  is  indicated.  In  the  row  labelled 
matching  errors,  the  number  of  zero-crossings  pixels  that  are  assigned  an  incorrect  match  are  listed. 
Note  that  we  distinguish  here  between  matching  errors  and  localization  errors.  Matching  errors  are 
those  that  arise  when  incorrect  zero-crossings  contours  are  matched,  independent  of  the  accuracy 
of  the  contours  themselves.  Such  errors  tend  to  be  relatively  large  in  disparity.  Localization  errors 
arc  those  that  arise  due  to  error  in  position  of  the  zero-crossing  contour  itself.  Such  errors  usually 
tend  to  be  relatively  small.  The  row  labelled  after  consistency  lists  the  number  of  such  matching 
errors  that  remain  after  the  consistency  constraint  is  applied  between  different  resolution  disparity 
maps. 

The  images  themselves  are  illustrated  in  Figures  9-20.  For  each  one,  wc  show  the  stereo  images, 
the  disparity  map  obtained  by  matching  the  zero-crossings  arc  the  finest  level  of  representation, 
and  a  contour  map  based  on  this  disparity  map.  The  disparity  maps  arc  displayed  using  intensity 
to  encode  height,  so  that  the  brighter  disparity  points  arc  closer.  To  obtain  a  contour  map 
representation  of  the  results,  we  have  applied  a  surface  reconstruction  algorithm  (Grimson  1982, 
1983a,  Terzopoulos,  1983,  1984]  to  the  stereo  data. 

The  first  pair  of  images,  from  the  Phoenix  area,  arc  illustrated  in  Figure  9,  and  were  supplied 
courtesy  of  the  Defense  Mapping  Agency.  A  second  stereo  pair  of  natural  terrain,  from  the 
Fort  Sill.  Oklahoma  area,  arc  illustrated  in  Figure  12.  and  were  supplied  courtesy  of  the  U.S. 
Army  F.nginccring  Topographic  laboratory.  The  next  stereo  pair,  from  the  University  of  British 
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Figure  11.  Contour  map  (Ft  Sill)  based  on  matching  before  consistency  check 


Figure  12.  Natural  terrain  stereo  pair  (Pheonix). 

Columbia,  and  supplied  courtesy  of  UBC.  arc  illustrated  in  Figure  16.  The  final  stereo  pair  are  of 
a  highway  interchange,  and  were  supplied  courtesy  of  Boeing  Coiporation. 

A  number  of  comments  are  in  order  concerning  the  performance  of  the  algorithm,  as  indicated 
above.  We  note  that  in  the  case  of  the  blocks  scene,  the  percentage  of  matched  zero-crossing  to  total 
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zero-crossing  is  small,  on  the  order  of  .17  percent.  Note,  however,  that  many  of  the  zero-crossings 
are  shallow,  unstable  zero-crossing,  corresponding  to  small  fluctuations  in  the  photometric  process, 
as  illustrated  by  Figure  3.  If  we  consider  only  z.cro-crossing  points  on  the  blocks  themselves,  then 
the  number  of  eligible  zero-crossing  points  reduces  to  2703,  of  which  1780  are  assigned  a  match. 
Note  further  that  this  number  of  1780  docs  not  include  any  strictly  horizontal  zero-crossing  points. 
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Figure  15.  Contour  map  (Pheonix)  based  on  matching  after  consistency  check 


Figure  lb  Natural  terrain  stereo  pair  (UBC). 

nor  docs  it  include  very  small  /cro-crossmg  contours,  which  fall  below  the  matching  thresholds, 
and  arc  hence  unmnichable. 

Ihc  Fort  Sill  image  does  provide  some  difficulty  for  the  algorithm,  particularly  because  the 
photometric  properties  of  the  images  cause  a  certain  amount  of  fluctuation  in  the  positions  of  the 
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Figure  17.  Disparity  map  (UBC). 


Figure  18.  Contour  map  (UBC)  based  on  matching  before  consistency  check. 

zero-crossing  contours.  As  a  consequence  of  the  design  of  the  matching  procedure,  which  favors 
no  match  to  possible  incorrect  matches,  a  large  number  of  the  potential  zero-crossing  points  are 
not  matched.  Note,  however,  that  the  percentage  of  matched  zero-crossings  to  total  zero-crossings 
is  somewhat  misleading,  since  a  large  number  of  the  total  arc  not,  in  fact,  matchablc.  In  this 
case,  at  least  ten  percent  of  the  zero-crossings  in  the  left  image  are  not  present  in  the  right  since 
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they  lie  beyond  ilk-  edge  of  ilk'  mi. iik\  We  .ilsu  mile  that  tile  contoui  map  di-.pl. i.,\i  in  I  igurv 
II  ix  based  on  ili-.-  K'Milis  of  the  matching  nlgottthm  before  the  consistency  check  is  applied  As 
a  consequence.  lire  clTeci  of  the  single  incorrectly  matched  contour  in  the  uppei  left  quadrant 
is  clc.tr!>  \isible  as  a  sudden  dip  in  die  contour  inap.  This  l  leal  Is  demonstrates  die  need  lor  a 
consistent.')  check  to  remove  obvious  matching  errors  that  survive  the  matching  process  itself 

In  die  Phoenix  images,  the  contour  map  of  Figure  14  is  also  generated  from  matching  data 
without  a  consistency  check.  In  figure  15.  we  apply  the  surface  reconstruction  algonthm  to  the 
data  after  applying  the  consistency  check.  We  also  have  relaxed  the  lightness  with  which  the 
reconstruction  is  forced  to  pass  through  the  stereo  data,  it  can  be  seen  that  the  lesulnng  contour 
map  has  removed  die  obvious  matching  defects  and  has  a  smoother  set  of  contours  I  his  smoother 
surface  reconstruction  is  one  means  of  removing  possible  locali/ation  errors  in  the  matched  data, 
as  well  as  matching  errors  that  survive  the  process. 

While  the  Tort  Sill  image  presents  a  great  deal  of  difficulty  to  die  algorithm  due  to  large 
fluctuations  in  the  positions  and  shapes  of  the  /cro-erossing  contours,  die  Boeing  image  presents  a 
different  type  of  difficulty .  Here,  die  large  number  of  extended,  parallel  image  contours  presents 
a  large  set  of  potential  ambiguities.  In  general,  however,  the  algorithm  is  able  to  solve  this 
problem,  by  relying  on  information  from  coarser  channncls  to  disambiguate  finer  ones.  Because 
the  interpolation  process  is  only  applicable  across  smooth  surfaces,  and  the  Boeing  image  contains 
a  large  number  of  surface  discontinuities,  we  have  omitted  the  contour  map  for  this  image. 

It  is  important  to  stress  with  all  of  the  contour  maps,  and  especially  for  the  UBC  images, 
that  these  illustrations  are  intended  as  a  graphical  means  of  display  ing  the  performance  of  the 
stereo  algonthm  but  not  as  a  precise  rcconstrucdon  of  the  underlying  terrain.  In  particular,  since 
one  of  the  parameters  of  the  surface  reconstruction  algorithm  is  the  degree  of  smoothing  applied 
to  die  reconstructed  surface,  the  resulting  contour  maps  may  exhibit  more  smoothing  than  is 
warranted,  due  to  the  choice  of  this  parameter.  Nonetheless  the  qualitative  performance  of  the 
stereo  algonthm  is  still  evident  by  the  arrangement  and  spacing  of  the  contours.  In  the  case  of  the 
stereo  pairs  with  buildings  and  other  artifacts  present,  the  application  of  the  surface  rcconstrucdon 
algorithm  directly  to  the  results  of  the  stereo  algorithm  is  actually  incorrect,  since  it  attempts  to  fit 
a  single  surfi.ee  over  what  arc  in  fact  several  disunci  surfaces.  To  be  completely  correct,  the  stereo 
depth  data  should  be  segmented  into  coherent  regions,  and  then  interpolated.  Since  this  was  not 
done,  the  resulting  surface  interpolation  tends  incorrectly  to  smooth  over  the  discontinuiues  in 
depth.  Nonetheless,  the  contour  maps  illustrated  srill  demonstrate  the  basic  performance  of  the 
stereo  algorithm  and  the  tightly  clustered  isometric  contours  help  to  indicate  the  separadons  of 
the  different  buildings  from  the  ground. 

5.  Discussion 

The  modified  Marr-Poggio-Grimson  algorithm  presented  here  was  originally  implemented  in 
LISP  on  an  MIT  Lisp  Machine,  and  then  recoded  in  Lisp  Machine  microcode,  for  more  efficient 
performance.  The  convolutions  of  the  images  were  performed  using  a  special  purpose  convoludon 
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device  |Nishih.ir.i  .tnd  I  arson.  19SIJ.  While  the  time  requited  to  process  .in  image  is  dependent 
on  .1  large  number  of  factors  involving  the  complexity  of  the  image,  it  is  possible  to  give  estimates 
on  die  performance  of  this  implementation  of  the  algorithm.  Using  a  :i'2()  x  it'JO  image  as  a  hasis. 
we  have  observed  the  following  timing  characteristics.  Ixich  convolution  of  an  image,  including 
time  required  to  interface  the  convolution  device  with  the  l  isp  Machine,  usually  required  on  the 
order  of  6  seconds,  l  ach  computation  of  a  zero-crossings  representation  t> picnllv  required  on  die 
order  of  -10  seconds.  The  amount  of  time  required  to  match  die  zero-crossing  representations  was 
highly  dependent  on  the  number  of  fixation  positions  required  (and  thus  on  the  lota!  disparity 
range  of  the  image).  Matching  at  each  such  fixation  position  usually  required  on  the  order  of 
■ri  d(l  seconds,  depending  on  the  structure  of  die  zero-crossings  contours,  f  inally,  combining  all 
the  slices  of  die  disparity  map  into  a  single  consistent  representation  typically  required  on  the 
order  of  .ft)  lit)  seconds.  I  hus.  for  example,  a  single  fine  resolution  channel  processing  of  die 
IRC  images  normally  took  under  f>  minutes  in  total,  and  the  total  time  for  running  three  different 
resolution  channels  was  on  die  order  of  10  minutes. 
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